Project - Ensemble Techniques

by HARI SAMYNAATH S

Part ONE

Domain: Telecom

Context:
A telecom company wants to use their historical customer data to predict behaviour to retain customers. You can analyse all relevant customer data and develop focused customer retention programs.

Data Description:
Each row represents a customer, each column contains customer’s attributes described on the column Metadata. The data set includes information about:
● Customers who left within the last month – the column is called Churn
● Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
● Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
● Demographic info about customers – gender, age range, and if they have partners and dependents

Project Objective:
Build a model that will help to identify the potential customers who have a higher probability to churn. This helps the company to understand the painpoints and patterns of customer churn and will increase the focus on strategizing customer retention.

Steps and Tasks:
1. Data Understanding and Exploration:
a. Read ‘TelcomCustomer-Churn_1.csv’ as a DataFrame and assign it to a variable.
b. Read ‘TelcomCustomer-Churn_2.csv’ as a DataFrame and assign it to a variable.
c. Merge both the DataFrames on key ‘customerID’ to form a single DataFrame.
d. Verify if all the columns are incorporated in the merged DataFrame by using simple comparison Operator in Python.

both data sets contains different attributes about the customers
lets check column customerID for commonality of records in both datasets

both data sets have 7043 unique records
lets check if they are about the same set of customers

the customerID of both datasets match with each other in the same sequence
a very simple merge will be sufficient

2. Data Cleaning and Analysis :
a. Impute missing/unexpected values in the DataFrame.

since the dataframe does not contain any null values (technically),
lets review contents of each column to understand better

there are varying datatypes of attributes
lets see in detail the unique values & their distribution in each columns

almost every feature has defined data

  1. instances like "No phone sevice" or "No internet service" are assumed to be directly associated with PhoneService "No" class & InternetService "No" class of customers repectively.
  2. the 11 blank entries in TotalCharges is assumed to be associated to 0 tenure customers

let us confirm the same below by table fitlers

clearly, all the records with "No phone service" are associated with records of "No" class in "PhoneService" attribute
hence they are not an anomoly

lets check the attributes containing "No internet service" records

similary, all the records with "No internet service" are associated with records of "No" class in "InternetService" attribute
hence they are too not an anomoly

Lets review the blank fields of TotalCharges

The 11 blanks of TotalCharges column are associated with 0 tenure records
These could be aptly imputed with 0

2. Data Cleaning and Analysis :
b. Make sure all the variables with continuous values are of ‘Float’ type.

2. Data Cleaning and Analysis :
c. Create a function that will accept a DataFrame as input and return pie-charts for all the appropriate Categorical features. Clearly show percentage distribution in the pie-chart.
d. Share insights for Q2.c.

before creating such a function, let us temporarily make object type entries in SeniorCitizen column
later all the object columns shall be encoded appropritely for model building

the above pie charts show us several distributions,

let us study the patterns along with ML tools shortly

2. Data Cleaning and Analysis :
e. Encode all the appropriate Categorical features with the best suitable approach.

2. Data Cleaning and Analysis :
f. Split the data into 80% train and 20% test.

2. Data Cleaning and Analysis :
g. Normalize/Standardize the data with the best suitable approach.

since the data range are widely varying, scaling is necessary

using StandardScaler because this z-score based method maintains the inherent weightage and overhang of the dataset, rather than bounding the limits like in Normalising
such a behaviour helps in imbalanced dataset.
the same can be observed in Y_train scaled values, the lower proportion of "Yes" have been weighted more in magnitude of "No"

While displaying result of classification model,inverse_transform must be applied on Y values

3. Model building and Improvement:
a. Train a model using XGBoost and use RandomizedSearchCV to train on best parameters.
Also print best performing parameters along with train and test performance.

Based on the scoreLog table, clearly tuned performances has increased in terms of recall score (as recall was the scoring parameter for search)
Lets also print the best parameters below

3. Model building and Improvement:
b. Train a model using XGBoost and use GridSearchCV to train on best parameters.
Also print best performing parameters along with train and test performance.

the above table scoreLog shows complete list of performance scores of all 3 models, both on training & test data sets
clearly randomizedsearchCV hits close to the best parameters than gridsearchCV
A test dataset scores of 0.55 for Churn_Yes recall and test accuracy of 0.81 were achieved in RandomizedSearchCV tuning,
improving from 0.35 & 0.79 respectively in base XGBoost model

======================================================================================================


Part TWO

• DOMAIN: IT

• CONTEXT: The purpose is to build a machine learning pipeline that will work autonomously irrespective of Data and users can save efforts involved in building pipelines for each dataset.

• PROJECT OBJECTIVE: Build a machine learning pipeline that will run autonomously with the csv file and return best performing model.

• STEPS AND TASK

  1. Build a simple ML pipeline which will accept a single ‘.csv’ file as input and return a trained base model that can be used for predictions. You can use 1 Dataset from Part 1 (single/merged).
  2. Create separate functions for various purposes.
  3. Various base models should be trained to select the best performing model.
  4. Pickle file should be saved for the best performing model.

Include best coding practices in the code:
• Modularization
• Maintainability
• Well commented code etc.

a ML pipeline was established
best model was identified and pickled for production readiness
test data was treated eauivalent to production data, with no dataleaks
the model was chosen based on recall score for Churn=Yes class
interestingly, the accuracy of the model is poor, while recall was at peak
this is not a by chance result, as a 10 fold cross validation score was referenced for ranking models,
yet, further evaluations methods could be implemented in future to build a better trustable model

follow-up action needed:

  1. visualistions were not incorporated except for end results, to be improved
  2. dependent packages were loaded using global variables from functions. not an effective style. need to explore options to extend defined classes to necessary packages
  3. poorly executed OOPS, inheritance not established, resulting in multiple fit, transform methods to be used. need to improvise.